Experimental benchmarking

Experimental benchmarking allows researchers to learn about the accuracy of non-experimental research designs. Specifically, one can compare observational results to experimental findings to calibrate bias. Under ordinary conditions, carrying out an experiment gives the researchers an unbiased estimate of their parameter of interest. This estimate can then be compared to the findings of observational research. Note that benchmarking is an attempt to calibrate non-statistical uncertainty (flaws in underlying assumptions). When combined with meta-analysis this method can be used to understand the scope of bias associated with a specific area of research.

History

The start of experimental benchmarking in social science is often attributed to Robert LaLonde. In 1986 he found that findings of econometric procedures assessing the effect of an employment program on trainee earnings did not recover the experimental findings.^[1]

Experimental benchmarking is often conducted in medical research, such as Schnell‐Inderst et al. (2017) ^[2] and Burden et al. (2017).^[3]

Procedural Considerations

The most instructive experimental benchmarking designs are done on a large scale. They also compare experimental and non-experimental work that looks at the same outcome and the same population.^[4]

Observational Designs That Can Be Assessed with Benchmarking

Non-experimental, or observational, research designs compare treated to untreated subjects while controlling for background attributes (called covariates). This estimation approach can also be called covariate adjustment. Covariates are attributes that exist prior to experimentation and therefore do not change based on treatment.^[5] Examples include age, gender, weight, and hair color. For example, if researchers are interested in the effect of smoking cessation classes on the number of cigarettes smoked a day, they may carry out covariate adjustment to control for ethnicity, income and the number of years the smoker has been smoking.

Covariate adjustment can be carried out in a variety of ways. Gordon et al. (2018) illustrate many of these methods by means of online advertising data, such as propensity score matching, stratification, regression adjustment, and inverse probability weighted regression adjustment. They find that despite great variation in variables within their data, observational methods cannot recover the causal effects of online advertising. This study ultimately provides evidence that without a randomized control trial, it is impossible to detect symptoms of bias. Bias is not always going to be in one direction or of the same magnitude.

Selected Examples of Experimental Benchmarking

Bloom et al. (2002) looks at the study of the impact of mandatory welfare-to-work programs to ask which non-experimental methods get closest to recovering the experimentally estimated effects of such programs. They also question if the most accurate non-experimental methods are accurate enough to take the place of experimental work. They ultimately argue that none of the methods approach the accuracy of experimental methods for recovering the parameter of interest.^[6]

Dehijia and Wahba (1999) examine LaLonde's (1989) data with additional non-experimental findings. They argue that when there is enough subject pool overlap and unobservable covariates do not impact outcomes, non-experimental methods can indeed estimate treatment impact accurately.^[7]

Glazerman, Levy and Myers (2003) perform experimental benchmarking in the context of employment services, welfare and job training. They determine that non-experimental methods may approximate experimental estimates, however these estimations can be biased enough to impact policy analysis and implementation.^[8]

Gordon et al. (2018) utilizes data from Facebook to see if the variation in data collected by the advertising industry allows for observational methods to recover the causal effects of online advertising. Specifically, the study aims to analyze the effectiveness of Facebook ads on three outcomes: checkout, registration and page view. They find that despite great variation made possible by the nature of social media, it is not possible to accurately recover the causal effects.^[9]

References

^[5]^[6]^[3]^[7]^[8]^[9]^[1]^[2]^[4]

^ ^a ^b LaLonde, Robert (1986). "Evaluating the Econometric Evaluations of Training Programs with Experimental Data". American Economic Review. 4 (76): 604–620.
^ ^a ^b Schnell‐Inderst, P., Iglesias, C. P., Arvandi, M. A. R. J. A. N., Ciani, O. R. I. A. N. A., Matteucci Gothe, R., Peters, J., ... & Siebert, U. (2017). A bias‐adjusted evidence synthesis of RCT and observational data: the case of total hip replacement. Health economics, 26, 46-69.
^ ^a ^b Burden, A., Roche, N., Miglio, C., Hillyer, E. V., Postma, D. S., Herings, R. M., Overbeek, J. A., Khalid, J. M., van Eickels, D., … Price, D. B. (2017). An evaluation of exact matching and propensity score methods as applied in a comparative effectiveness study of inhaled corticosteroids in asthma. Pragmatic and observational research, 8, 15-30. doi:10.2147/POR.S122563
^ ^a ^b Smith, Jeffrey, and Petra Todd. 2001. "Reconciling Conflicting Evidence on the Performance of Matching Methods?" American Economic Review, Papers and Proceedings 91(2): 112-118
^ ^a ^b "10 Things to Know About Covariate Adjustment". Evidence in Government and Politics. EGAP. Retrieved 26 November 2018.
^ ^a ^b Bloom, H. S., Michalopoulos, C., Hill, C. J., & Lei, Y. (2002). Can Nonexperimental Comparison Group Methods Match the Findings from a Random Assignment Evaluation of Mandatory Welfare-to-Work Programs? MDRC Working Papers on Research Methodology.
^ ^a ^b Dehejia, R. H., & Wahba, S. (1999). Causal effects in nonexperimental studies: Reevaluating the evaluation of training programs. Journal of the American statistical Association, 94(448), 1053-1062.
^ ^a ^b Glazerman, S., Levy, D. M., & Myers, D. (2003). Nonexperimental versus experimental estimates of earnings impacts. The Annals of the American Academy of Political and Social Science, 589(1), 63-93.
^ ^a ^b Gordon, Brett R., Florian Zettelmeyer, Neha Bhargava, and Dan Chapsky. 2018. A Comparison of Approaches to Advertising Measurement: Evidence from Big Field Experiments at Facebook. papers.ssrn.com/sol3/papers.cfm?abstract_id=3033144

Experimental benchmarking

History

Procedural Considerations

Observational Designs That Can Be Assessed with Benchmarking

Selected Examples of Experimental Benchmarking

References

Further reading

Medicine

Social Sciences